SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Abstract

Can LLMs consistently improve their previous outputs for better results? Forthis to be true, LLMs would need to be better at discriminating amongpreviously-generated alternatives, than generating initial responses. Weexplore the validity of this hypothesis in practice. We first formulate aunified framework that allows us to compare the generative and discriminativecapability of any model on any task. In our resulting experimental analysis ofseveral open-source and industrial LLMs, we observe that models are notreliably better at discriminating among previously-generated alternatives thangenerating initial responses. This finding challenges the notion that LLMs maybe able to enhance their performance only through their own judgment.

Quick Read (beta)

loading the full paper ...